An Efficient Linear Pseudo-minimization Algorithm for Aho-Corasick Automata
نویسندگان
چکیده
A classical construction of Aho and Corasick solves the pattern matching problem for a finite set of words X in linear time, where the size of the input X is the sum of the lengths of its elements. It produces an automaton that recognizes A∗X, where A is a finite alphabet, but which is generally not minimal. As an alternative to classical minimization algorithms, which yields a O(n logn) solution to the problem, we propose a linear pseudo-minimization algorithm specific to Aho-Corasick automata, which produces an automaton whose size is between the size of the input automaton and the one of its associated minimal automaton. Moreover this algorithm generically computes the minimal automaton: for a large variety of natural distributions the probability that the output is the minimal automaton of A∗X tends to one as the size of X tends to infinity.
منابع مشابه
EffCLiP: Efficient Coupled-Linear Packing for Finite Automata
Finite-automata are widely-recognized as a fundamental computing model with a broad range of applications, notably network monitoring. We propose a new approach, “efficient coupled-linear packing” (EffCLiP), that optimizes both finite-automata size and performance. EffCLiP employs a novel transition representation that enables a simple addressing operator (integer addition) while providing flex...
متن کاملText matching of strings in terms of straight line program by compressed aleshin type automata
In this paper we are checking the equivalence of any given text of strings is represented by a straight line program (SLP) with model text. For a given SLP-compressed Aleshin type automata D of size n and height h representing m patterns of total length N, we present an O (n log N)-size representation of Aho-Corasick automaton which recognizes all occurrences of the patterns in D in amortized O...
متن کاملConstruction of Aho Corasick Automaton in Linear Time for Integer Alphabets
We present a new simple algorithm that constructs an Aho Corasick automaton for a set of patterns, P , of total length n, in O(n) time and space for integer alphabets. Processing a text of size m over an alphabet Σ with the automaton costs O(m log |Σ|+k), where there are k occurrences of patterns in the text. A new, efficient implementation of nodes in the Aho Corasick automaton is introduced, ...
متن کاملAutomata-Theoretic Analysis of Bit-Split Languages for Packet Scanning
Bit-splitting breaks the problem of monitoring traffic payloads to detect the occurrence of suspicious patterns into several parallel components, each of which searches for a particular bit pattern. We analyze bit-splitting as applied to Aho-Corasick style string matching. The problem can be viewed as the recovery of a special class of regular languages over product alphabets from a collection ...
متن کاملImplementing the Aho-Corasick Automata for Phonetic Search
In phonetic search, the goal is to find in a text all words with the same pronunciation as the search phrase. The user writes the word down using a different alphabet and transcription rules. Mrázová et al. proposed a new method for phonetic search based on searching for all possible transcriptions with Aho-Corasick automata [8]. Their algorithm offers better precision than the previous existin...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012